Stock Market prices are notoriously difficult to model, but advances in machine learning algorithms in recent years provide renewed possibilities in accurately modeling market performance. One notable addition in modern machine learning is that of Natural Language Processing (NLP). For those modeling a specific stock, performing NLP feature extraction and analysis on the collection of news headlines, shareholder documents, or social media postings that mention the company can provide additional information about the human/social elements to predicting market behaviors. These insights could not be captured by historical price data and technical indicators alone.
President Donald J. Trump is one of the most prolific users of social media, specifically Twitter, using it as a direct messaging channel to his followers, avoiding the traditional filtering and restriction that normally controls the public influence of the President of the United States. An additional element of the presidency that Trump has avoided is that of financial transparency and divesting of assets. Historically, this is done in order to avoid conflicts of interest, apparent or actual. The president is also known to target companies directly with his Tweets, advocating for specific changes/decisions by the company, or simply airing his greivances. This leads to the natural question, how much influence does President Trump exert over the financial markets?
To explore this question, we built multiple types of models attempting to answer this question, using the S&P500 as our market index. First, we built a classification model to predict the change in stock price 60 mins after the tweet. We trained Word2Vec embeddings on President Trump's tweets since his election, which we used as the embedding layer for LSTM and GRU neural networks.
We next build a baseline time series regression model, using historical price data alone to predict price by trading-hour. We then built upon this, adding several technical indicators of market performance as additional features. Finally, we combined the predicitons of our classification model, as well as several other metrics about the tweets (sentiment scores, # of retweets/favorites, upper-to-lowercase ratio,etc.) to see if combining all of these sources of information could explain even more of the variance in stock market prices.
Can the Twitter activity of Donald Trump explain fluctuations in the stock market?¶
We will use a combination of traditional stock market forecasting combined with Natural Language Processing and word embeddings from President Trump's tweets to predict fluctuations in the stock market (using S&P 500 as index).
Question 1: Can we predict if stock prices will go up or down at a fixed time point, based on the language in Trump's tweets?
Question 2: How well can explain stock market fluctuations using only historical price data?
Trained Word2Vec embeddings on collection of Donal Trump's Tweets.
Classified tweets based on change in stock price (delta_price)
NOTE: This model's predictions will become a feature in our final model.
Model 1: Use price alone to forecast hourly price.
Model 2: Use price combined with technical indicators.
* LSTM neural network
Delta-Stock-Price NLP Models
Stock-Market-Forecasting
All Donald Trump tweets from 12/01/2016 (pre-inaugaration day) to end of 08/23/2018
Minute-resolution data for the S&P500 covering the same time period.
## IMPORT CUSTOM CAPSTONE FUNCTIONS
import functions_combined_BEST as ji
import functions_io as io
from functions_combined_BEST import ihelp, ihelp_menu,\
reload, inspect_variables
## IMPORT MY PUBLISHED PYPI PACKAGE
import bs_ds as bs
from bs_ds.imports import *
## IMPORT CONVENIENCE/DISPLAY FUNCTIONS
from pprint import pprint
import qgrid
import json
import ipywidgets as widgets
# Import plotly and cufflinks for iplots
import plotly
import cufflinks as cf
from plotly import graph_objs as go
from plotly.offline import iplot
cf.go_offline()
# Suppress warnings
import warnings
warnings.filterwarnings('ignore')
#Set pd.set_options for tweet visibility
pd.set_option('display.max_colwidth',100)
pd.set_option('display.max_columns',50)
## Saving the sys.stdout to restore later
import sys
__stdout__=sys.stdout
file_dict = io.def_filename_dictionary(load_prior=False, save_directory=True)
from functions_combined_BEST import ihelp_menu2
# file_dict = ji.load_filename_directory()
np.random.seed(42)
To prepare Donal Trump's tweets for modeling, it is essential to preprocess the text and simplify its contents.
Version 1 of the tweet processing removes these items, as well as the removal of any urls in a tweet. The resulting data column is referred to here as "content_min_clean".
Version 2 of the tweet processing removes these items and the resulting data column is referred here as
cleaned_stopped_content
regexp_tokenziation expression) "([a-zA-Z]+(?:'[a-z]+)?)", which allows for words such as "can't" that contain "'" in the middle of word. This processes was actually applied in order to process Version 1 and 2 of the Tweets, but the resulting text was put back into sentence form. Version 3 of the tweets keeps the text in their regexp-tokenized form and is reffered to as
cleaned_stopped_tokens
Version 4 of the tweets are all reduced down to their word lemmas, futher aiding the algorithm in learning the meaning of the texts.
reload(ji)
func_list = [ji.load_raw_twitter_file,
ji.make_stopwords_list,
ji.full_twitter_df_processing,
ji.full_sentiment_analysis]
ji.ihelp_menu(func_list)
ji.save_ihelp_menu_to_file(func_list,filename='_twitter_processing')
## Load in raw csv of twitter_data, create date_time_index, rename columns
raw_tweets = file_dict['twitter_df']['raw_tweet_file']
twitter_df = ji.load_raw_twitter_file(filename=raw_tweets,
date_as_index=True,
rename_map={'text': 'content',
'created_at': 'date'})
## Create list of stopwords for twitter processing
stop_words = ji.make_stopwords_list(incl_punc=True, incl_nums=True,
add_custom=['http','https',
'...','…','``',
'co','“','“','’','‘','”',
"n't","''",'u','s',"'s",
'|','\\|','amp',"i'm","mr"])
## Process twitter data:
# 1. create minimally cleaned column `content_min_clean` with urls
twitter_df = ji.full_twitter_df_processing(twitter_df,
raw_tweet_col='content',
name_for_cleaned_tweet_col='content_cleaned',
name_for_stopped_col='cleaned_stopped_content',
name_for_tokenzied_stopped_col='cleaned_stopped_tokens',
use_col_for_case_ratio=None,
use_col_for_sentiment='content_min_clean',
RT=True, urls=True, hashtags=True, mentions=True,
str_tags_mentions=True,
stopwords_list=stop_words, force=False)
## Display Index information
ji.index_report(twitter_df,label='twitter_df')
## Check for strings that exceed the correct tweet length
keep_idx = ji.check_length_string_column(twitter_df, 'content_min_clean',length_cutoff=400,display_describe=False)
## verify no issues arise.
if keep_idx.isna().sum()>0:
raise Exception('')
else:
twitter_df=twitter_df[keep_idx]
print(f'removed {np.sum(keep_idx==False)}')
ji.check_length_string_column(twitter_df, 'content_min_clean',length_cutoff=400,return_keep_idx=False)
twitter_df.head(2)
## Search all tweets for occurances of specific words
word = 'fed'
idx_russia_tweets = ji.search_for_tweets_with_word(twitter_df, word =word,
display_n=5, from_column='content',
return_index=True, display_df=True)
func_list = [ji.load_raw_stock_data_from_txt,
ji.set_timeindex_freq,
ji.load_twitter_df_stock_price]
ji.ihelp_menu(func_list)
ji.save_ihelp_menu_to_file(func_list,'_stock_data_to_twitter_data')
print(f"[i] # number of tweets = {twitter_df.shape[0]}")
## add stock_price for twitter_df
null_ratio = ji.check_null_small(twitter_df,null_index_column='case_ratio')
print(f'[!] {len(null_ratio)} null values for "case_ratio" are tweets containing only urls. Dropping...')
twitter_df.dropna(subset=['is_retweet','case_ratio'],inplace=True)
print(f"[i] New # of tweets = {twitter_df.shape[0]}\n")
twitter_df = ji.load_twitter_df_stock_price(twitter_df,
get_stock_prices_per_tweet=True,
price_mins_after_tweet=60)
ji.index_report(twitter_df);
idx_null_delta = ji.check_null_small(twitter_df,null_index_column='delta_price');
print(f"[!] {len(idx_null_delta)} null values for 'delta_price' were off-hour tweets,\
more than 1 day before the market reopened. Dropping...")
twitter_df.dropna(subset=['delta_price'], inplace=True)
print(f"\n[i] Final # of tweets = {twitter_df.shape[0]}")
ji.column_report(twitter_df,as_df=True)
## Examine delta_price
print("CURRENT # OF POSTITIVE AND NEGATIVE PRICE DELTAS:")
print(twitter_df['delta_price_class'].value_counts())
## Examining Changes to classes if use a "No Change" cutoff of $0.05
delta_price = twitter_df['delta_price']
small_pos =[ 0 < x <.05 for x in delta_price] #/len(delta_price)
small_neg = [-.05<x <0 for x in delta_price]
print('\nCHANGES TO CLASSES IF USING ATHRESHOLD OF $0.05:\n','---'*12)
print(f'# Positive Delta -> "No Change" = {np.sum(small_pos)}')
print(f'# Negative Delta -> "No Change" = {np.sum(small_neg)}')
print(f'# of Unchanged Classifications = {len(delta_price)-(np.sum(small_pos)+np.sum(small_neg))}')
## BIN DELTA PRICE CLASS
bins = pd.IntervalIndex.from_tuples([ (-np.inf,-.05), (-.05,.05), (.05,np.inf)], closed='left')
## Save indexer column for 'delta_price'
twitter_df['indexer'] = bins.get_indexer(twitter_df['delta_price'])
# remap -1,0,1,2 to classes
mapper ={-1:np.nan, 0:0, 1:1,2:2}
# remap string classes
mapper2 = {0:'neg', 1:'no_change',2:'pos'}
## Use indexer to map new integer values
twitter_df['delta_price_class_int']= twitter_df['indexer'].apply(lambda x: mapper[x])
twitter_df['delta_price_class'] = twitter_df['delta_price_class_int'].apply(lambda x: mapper2[x])
## Verify mapping of string and integer classes
res1 = pd.DataFrame(twitter_df['delta_price_class'].value_counts())
res2 = pd.DataFrame(twitter_df['delta_price_class_int'].value_counts())
bs.display_side_by_side(res1,res2)
ji.plotly_price_histogram(twitter_df,show_fig=True,as_figure=False)
ji.plotly_pie_chart(twitter_df, column_to_plot='delta_price_class',show_fig=True, as_figure=False)
nlp_df = twitter_df.loc[twitter_df['delta_price_class']!='no_change'].copy()
nlp_df.dropna(inplace=True)
nlp_df.head(2)
# Generate wordclouds
twitter_df_groups,twitter_group_text = ji.get_group_texts_for_word_cloud(nlp_df,
text_column='cleaned_stopped_lemmas',
groupby_column='delta_price_class')
ji.compare_word_clouds(text1=twitter_df_groups['pos']['joined'],
label1='Stock Market Increased',
text2= twitter_df_groups['neg']['joined'],
label2='Stock Market Decreased',
twitter_shaped = True, verbose=1,
suptitle_y_loc=0.75,
suptitle_text='Most Frequent Words by Stock Price +/- Change',
wordcloud_cfg_dict={'collocations':True},
save_file=True,filepath_folder='',
png_filename=file_dict['nlp_figures']['word_clouds_compare'],
**{'subplot_titles_fontdict':{'fontsize':26,'fontweight':'bold'},
'suptitle_fontdict':{'fontsize':40,'fontweight':'bold'},
'group_colors':{'group1':'green','group2':'red'},
});
## Comparing words ONLY unique to each group
df_pos_words, df_neg_words = ji.compare_freq_dists_unique_words(text1=twitter_df_groups['pos']['text_tokens'],
label1='Price Increased',
text2=twitter_df_groups['neg']['text_tokens'],
label2='Price Decreased',
top_n=20, display_dfs=True,
return_as_dicts=False)
pos_freq_dict, neg_freq_dict = ji.compare_freq_dists_unique_words(text1=twitter_df_groups['pos']['text_tokens'],
label1='Price Increased',
text2=twitter_df_groups['neg']['text_tokens'],
label2='Price Decreased',
top_n=20, display_dfs=False,
return_as_dicts=True)
## WORDCLOUD OF WORDS UNIQUE TO TWEETS THAT INCREASED VS DECREASED STOCK PRICE
ji.compare_word_clouds(text1= pos_freq_dict,label1='Stock Price Increased',
text2=neg_freq_dict, label2='Stock Price Decreased',
twitter_shaped=True, from_freq_dicts=True,
suptitle_y_loc=0.75,wordcloud_cfg_dict={'collocations':True},
suptitle_text='Words Unique to Stock Price +/- Change',
save_file=True,filepath_folder='',
png_filename=file_dict['nlp_figures']['word_clouds_compare_unique'],
**{'subplot_titles_fontdict':
{'fontsize':26,
'fontweight':'bold'},
'suptitle_fontdict':{
'fontsize':40,
'fontweight':'bold'},
'group_colors':{
'group1':'green','group2':'red'}
});
ji.make_tweet_bigrams_by_group(twitter_df_groups)
func_list = [ji.make_word2vec_model,ji.get_wv_from_word2vec,
ji.get_w2v_kwargs,ji.Word2vecParams]
ihelp_menu(func_list)
ji.save_ihelp_menu_to_file(func_list,'_word2vec')
## Loading custom class for tracking Word2Vec parameters
w2vParams = ji.Word2vecParams()
w2vParams.params_template()
## FITING WORD2VEC AND TOKENIZER
params = {
'text_column': 'cleaned_stopped_lemmas',
'window':3,
'min_count':2,
'epochs':10,
'sg':0,
'hs':1,
'negative':0,
'ns_exponent':0.0
}
model_kwds= ji.get_w2v_kwargs(params)
# text_data = twitter_df[params['text_column']]
## using df_tokenize for full body of a text for word2vec
word2vec_model = ji.make_word2vec_model(twitter_df,
text_column = params['text_column'],
window = params['window'],
min_count= params['min_count'],
epochs = params['epochs'],
verbose=1,
return_full=True,
**model_kwds)
w2vParams.append(params)
wv = word2vec_model.wv
### USING WORD VECTOR MATH TO GET A FEEL FOR QUALITY OF MODEL
wv = word2vec_model.wv
def V(string,wv=wv):
return wv.get_vector(string)
def equals(vector,wv=wv):
return wv.similar_by_vector(vector)
list_of_equations = ["V('republican')-V('honor')",
"V('man')+V('power')",
"V('russia')+V('honor')",
"V('china')+V('tariff')",
"V('trump')+V('lie')"]
for eqn in list_of_equations:
print(f'\n* {eqn} =')
res = eval(f"equals({eqn})")
[print('\t',x) for x in res]
import functions_io as io
io.save_word2vec(word2vec_model,file_dict,parms_dict=w2vParams.last_params)
## Select smaller subset of twitter_df for df_tokenize
columns_for_model_0 = ['delta_price_class','delta_price','pre_tweet_price',
'post_tweet_price','delta_time','B_ts_rounded','B_ts_post_tweet','content',
'content_min_clean','cleaned_stopped_content','cleaned_stopped_tokens',
'cleaned_stopped_lemmas','delta_price_class_int']
df_tokenize=twitter_df[columns_for_model_0].copy()
ji.check_class_balance(df_tokenize,'delta_price_class_int',as_raw=True, as_percent=False)
ji.check_class_balance(df_tokenize,'delta_price_class',as_raw=False)
ji.save_ihelp_to_file(ji.undersample_df_to_match_classes)
ihelp_menu([ji.undersample_df_to_match_classes])
## RESTRICTING TIME DELTAS FOR MODEL
remove_delta_time_tweets=True
## RESAMPLING
undersample_to_match_classes = True
class_column='delta_price_class'
class_list_to_keep = None # None=all classes or ['neg','pos']
## Display results
show_tweet_versions = True
print('[0] INITIAL CLASS COUNTS.')
## Print initial class balance
ji.check_class_balance(df_tokenize,col=class_column);
## REMOVE TWEETS BASED ON TIME BETWEEN TWEET AND STOCK PRICE VALUE
if remove_delta_time_tweets:
## SAMPLE ONLY TWEETS WITHIN 1 DAY OF STOCK MARKET PRICE DATA
df_sampled = df_tokenize.loc[df_tokenize['delta_time']<'1 day']
print(f"[1] # OF DAYS REMOVED BY 'delta_time' = {df_tokenize.shape[0]-df_sampled.shape[0]}")
ji.check_class_balance(df_sampled, col=class_column, as_raw=True, as_percent=False)
else:
print('[1] Skipping removing tweets by time_delta')
df_sampled = df_tokenize
## UNDERSAMPLE FROM UNBALANCED CLASSES
if undersample_to_match_classes:
## Print status
if class_list_to_keep is None:
print_class_list= list(df_sampled[class_column].unique())
else:
print_class_list = class_list_to_keep
print(f'[2] RESAMPLING DF TO MATCH SMALLEST CLASS.\n\tBalancing: {print_class_list}')
## RESAMPLE TO MATCH CLASSES
df_sampled = ji.undersample_df_to_match_classes(df_sampled,
class_column=class_column,
class_values_to_keep=class_list_to_keep,verbose=0)
ji.check_class_balance(df_sampled,col=class_column, as_percent=False)
else:
print('\n[2] Skipping balancing classes and keeping all 3 classes.')
## Display final output
dash = '---'*20
print(f"\n\n [i] Final class balance:")
ji.check_class_balance(df_sampled,col=class_column)
display(df_sampled.head(2))
show_tweet_versions=True
if show_tweet_versions:
ji.display_same_tweet_diff_cols(df_sampled,
columns = ['content' ,'content_min_clean',
'cleaned_stopped_content',
'cleaned_stopped_tokens',
'cleaned_stopped_lemmas'],as_md=True)
ji.check_class_balance(df_sampled)
text_data = df_sampled['cleaned_stopped_lemmas']
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import text, sequence
from keras.utils import to_categorical
## prepare y
# Changed for class imblanace #
y = to_categorical(df_sampled['delta_price_class_int'],num_classes=3)
wv = ji.get_wv_from_word2vec(word2vec_model)
tokenizer = Tokenizer(num_words=len(wv.vocab))
## FIGURE OUT WHICH VERSION TO USE WITH SERIES:
tokenizer.fit_on_texts(text_data)
# return integer-encoded sentences
X = tokenizer.texts_to_sequences(text_data)
X = sequence.pad_sequences(X)
## Save word indices
word_index = tokenizer.index_word
reverse_index = {v:k for k,v in word_index.items()}
## Get training/test split
X_train, X_test, y_train, y_test = ji.train_test_val_split(X, y, test_size=0.15, val_size=0)
# ji.check_y_class_balance(data=[y_train,y_test])
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
# del X
func_list = [ji.make_keras_embedding_layer]
ihelp_menu(func_list)
ji.save_ihelp_to_file(func_list[0])
from keras import callbacks, models, layers, optimizers, regularizers
early_stop = callbacks.EarlyStopping(monitor='loss',mode='min',patience=5,min_delta=.001,verbose=2)
callbacks=[early_stop]
## Make model infrastructure:
model0 = models.Sequential()
## Get and add embedding_layer
embedding_layer = ji.make_keras_embedding_layer(wv, X_train)
model0.add(embedding_layer)
# model0.add(layers.SpatialDropout1D(0.2))
model0.add(layers.Bidirectional(layers.LSTM(units=100, return_sequences=False,
dropout=0.3,recurrent_dropout=0.3,
kernel_regularizer=regularizers.l2(.01))))
model0.add(layers.Dense(3, activation='softmax'))
model0.compile(loss='categorical_crossentropy',optimizer="adam",metrics=['acc'])#,'val_acc'])#, callbacks=callbacks)
model0.summary()
ihelp_menu(ji.evaluate_classification)
ji.save_ihelp_to_file(ji.evaluate_classification)
## set params
num_epochs = 10
validation_split = 0.2
clock = bs.Clock()
clock.tic()
dashes = '---'*20
print(f"{dashes}\n\tFITTING MODEL:\n{dashes}")
history0 = model0.fit(X_train, y_train,
epochs=num_epochs,
verbose=True,
validation_split=validation_split,
batch_size=300,
callbacks=callbacks)
clock.toc()
cm_fname = file_dict['model_0A']['fig_conf_mat.ext']
hist_fname = file_dict['model_0A']['fig_keras_history.ext']
summary_fname = file_dict['model_0A']['model_summary']
df_class_report0A,fig0A=ji.evaluate_classification(model0,history0,
X_train, X_test,
y_train, y_test,
report_as_df=False,
binary_classes=False,
conf_matrix_classes=['Decrease','No Change','Increase'],
normalize_conf_matrix=True,
save_history=True, history_filename=hist_fname,
save_conf_matrix_png=True, conf_mat_filename=cm_fname,
save_summary=True,summary_filename=summary_fname)
save_me_as_model_0A=True
save_me_as_pred_nlp = False
ji.reload(ji)
if save_me_as_pred_nlp:
model_key='nlp_model_for_predictions'
elif save_me_as_model_0A:
model_key='model_0A'
filename = file_dict[model_key]['base_filename']
nlp_files = ji.save_model_weights_params(model0,check_if_exists=True,auto_increment_name=True,
auto_filename_suffix=True,filename_prefix=filename)
file_dict[model_key]['output_filenames'] = nlp_files
ji.update_file_directory(file_dict)
Our model had difficulty classifying tweets by delta_price, but did perform better than chance (36% accuracy vs chance=33%). We will next attempt to use another type of recurrent-neural-network layer, the Gated Rectifier Unit (GRU).
## GRU Model
from keras import models, layers, optimizers, regularizers
model0B = models.Sequential()
## Get and add embedding_layer
embedding_layer = ji.make_keras_embedding_layer(wv, X_train)
model0B.add(embedding_layer)
model0B.add(layers.SpatialDropout1D(0.3))
model0B.add(layers.GRU(units=100, dropout=0.3, recurrent_dropout=0.2,return_sequences=True))
model0B.add(layers.GRU(units=100, dropout=0.3, recurrent_dropout=0.2))
# model0.add(layers.Dense(units=50, activation='relu'))#, activation='tan' # activation='relu'))#removed 08/21
model0B.add(layers.Dense(3, activation='softmax'))
model0B.compile(loss='categorical_crossentropy',optimizer="adam",metrics=['acc'])#,'val_acc'])#, callbacks=callbacks)
model0B.summary()
num_epochs = 10
clock = bs.Clock()
clock.tic()
historyB = model0B.fit(X_train, y_train, epochs=num_epochs, verbose=True, validation_split=0.1,
batch_size=300)#, class_weight=class_weight)#callbacks=callbacks,, validation_data=(X_val))
clock.toc()
model_key = "model_0B"
cm_fname = file_dict[model_key]['fig_conf_mat.ext']
hist_fname = file_dict[model_key]['fig_keras_history.ext']
summary_fname = file_dict[model_key]['model_summary']
df_class_report0B, fig0B = ji.evaluate_classification(model0B, historyB,
X_train, X_test, y_train,y_test,report_as_df=False,
conf_matrix_classes=['Decrease','No Change','Increase'],
binary_classes=False, normalize_conf_matrix=True,
save_history=True, history_filename=hist_fname,
save_conf_matrix_png=True, conf_mat_filename=cm_fname,
save_summary=True,summary_filename=summary_fname)
save_me_as_model_0B=True
save_me_as_pred_nlp = False
ji.reload(ji)
if save_me_as_pred_nlp:
model_key='nlp_model_for_predictions'
elif save_me_as_model_0B:
model_key='model_0B'
filename = file_dict[model_key]['base_filename']
nlp_files = ji.save_model_weights_params(model0B,check_if_exists=True,auto_increment_name=True,
auto_filename_suffix=True,filename_prefix=filename)
file_dict[model_key]['output_filenames'] = nlp_files
ji.update_file_directory(file_dict)
# ji.dict_dropdown(file_dict)
The GRU performed better than the LSTM model, with 39% validation accuracy.
ji.inspect_variables(locals(),show_how_to_delete=False)
del_me= ['one_hot_results','nlp_df','text_data']#list of variable names
for me in del_me:
try:
exec(f'del {me}')
print(f'del {me} succeeded')
except:
print(f'del {me} failed')
continue
# DISPLAY CODE TO BE USED BELOW TO LOAD AND PROCESS STOCK DATA
functions_used=[ji.load_processed_stock_data, # This script combines the oriignal 4 used:
ji.load_raw_stock_data_from_txt,
ji.set_timeindex_freq,ji.custom_BH_freq,
ji.get_technical_indicators]
ji.ihelp_menu(functions_used)
ji.save_ihelp_menu_to_file(functions_used,'_stock_df_processing')
Due to the estimation of price being a precise regression, accuracy will not be an appropriate metric for judging model performance.
Thiel's U:
| Thiel's U Value | Interpretation |
|---|---|
| <1 | Forecasting is better than guessing |
| 1 | Forecasting is about as good as guessing |
| >1 | Forecasting is worse than guessing |
fname = file_dict['stock_df']['raw_csv_file']
raw_stock_df = ji.load_raw_stock_data_from_txt(filename = fname, verbose=2)
fig = ji.plotly_time_series(raw_stock_df, y_col='BidClose',as_figure=True)
stock_df = ji.get_technical_indicators(raw_stock_df,make_price_from='BidClose')
del raw_stock_df
# SELECT DESIRED COLUMNS
stock_df = stock_df[[
'price','ma7','ma21','26ema','12ema','MACD','20sd',
'upper_band','lower_band','ema','momentum']]
# Make stock_price for twitter functions
stock_df.dropna(inplace=True)
ji.index_report(stock_df)
display(stock_df.head(3))
func_list = [ji.train_test_split_by_last_days,
ji.make_scaler_library,
ji.transform_cols_from_library,
ji.make_train_test_series_gens]
ihelp_menu(func_list)
ji.save_ihelp_menu_to_file(func_list,'_stock_data_prep_for_modeling')
## SPECIFY # OF TRAINING TEST DAYS
num_test_days=10
num_train_days= 260
### SPECIFY Number of days included in each X_sequence (each prediction)
days_for_x_window=1
# Calculate number of rows to bin for x_windows
periods_per_day = ji.get_day_window_size_from_freq( stock_df, ji.custom_BH_freq() )
## Get the number of rows for x_window
x_window = periods_per_day * days_for_x_window#data_params['days_for_x_window']
print(f'X_window size = {x_window} -- ({days_for_x_window} day(s) * {periods_per_day} rows/day)\n')
## Train-test-split by the # of days
df_train, df_test = ji.train_test_split_by_last_days(stock_df,
periods_per_day =periods_per_day,
num_test_days = num_test_days,
num_train_days = num_train_days,
verbose=1, iplot=True)
###### RESCALE DATA USING MinMaxScalers FIT ON TRAINING DATA's COLUMNS ######
display(df_train.head(2).style.set_caption('df_train - pre-scaling'))
scaler_library, df_train = ji.make_scaler_library(df_train, transform=True, verbose=1)
df_test = ji.transform_cols_from_library(df_test, col_list=None,
scaler_library=scaler_library,
inverse=False)
display(df_train.head(2).style.set_caption('df_train - post-scaling'))
# Show transformed dataset
# display( df_train.head(2).round(3).style.set_caption('training data - scaled'))
# Create timeseries generators
train_generator, test_generator = ji.make_train_test_series_gens(
df_train['price'], df_test['price'],
x_window=x_window,n_features=1,batch_size=1, verbose=0)
from keras.models import Sequential
from keras import optimizers
from keras.layers import Bidirectional, Dense, LSTM, Dropout
from keras.regularizers import l2
# Specifying input shape (size of samples, rank of samples?)
n_input = x_window
n_features = 1 # just stock Price
print(f'input shape: ({n_input},{n_features})')
input_shape=(n_input, n_features)
# Create model architecture
model1 = Sequential()
model1.add(LSTM(units=50, input_shape =input_shape,return_sequences=True))#,kernel_regularizer=l2(0.01),recurrent_regularizer=l2(0.01),
model1.add(LSTM(units=50, activation='relu'))
model1.add(Dense(1))
model1.compile(loss=ji.my_rmse, metrics=['acc'],
optimizer=optimizers.Nadam())
display(model1.summary())
## FIT MODEL
dashes = '---'*20
print(f"{dashes}\n\tFITTING MODEL:\n{dashes}")
## set params
epochs=5
# override keras warnings
ji.quiet_mode(True,True,True)
# Instantiating clock timer
clock = bs.Clock()
clock.tic('')
# Fit the model
history = model1.fit_generator(train_generator,
epochs=epochs,
verbose=2,
use_multiprocessing=True,
workers=3)
clock.toc('')
model_key = "model_1"
hist_fname = file_dict[model_key]['fig_keras_history.ext']
summary_fname = file_dict[model_key]['model_summary']
# eval_results = ji.evaluate_model_plot_history(model1, train_generator, test_generator)
ji.evaluate_regression_model(model1,history,
train_generator=train_generator,
test_generator=test_generator,
true_test_series=df_test['price'],
true_train_series =df_train['price'],
save_history=True,history_filename=hist_fname,
save_summary=True, summary_filename=summary_fname)
## Get true vs pred data as a dataframe and iplot
df_model1 = ji.get_model_preds_df(model1,
test_generator = test_generator,
true_train_series = df_train['price'],
true_test_series = df_test['price'],
include_train_data=True,
inverse_tf = True,
scaler = scaler_library['price'],
preds_from_gen = True,
preds_from_train_preds = True,
preds_from_test_preds = True,
iplot = True, iplot_title='Model 1: True Vs Predicted S&P 500 Price',
verbose=0)
# Get evaluation metrics
df_results1, dfs_results1, df_shifted1 =\
ji.compare_eval_metrics_for_shifts(df_model1['true_test_price'],
df_model1['pred_from_gen'],
shift_list=np.arange(-4,4,1),
true_train_series_to_add=df_model1['true_train_price'],
display_results=True,
display_U_info=True,
return_results=True,
return_styled_df=True,
return_shifted_df=True)
save_model=True
ji.save_model_dfs(file_dict, 'model_1',df_model1,dfs_results1,df_shifted1)
filename_prefix = file_dict['model_1']['base_filename']
if save_model ==True:
model_1_output_files = ji.save_model_weights_params(model1,
filename_prefix=filename_prefix,
auto_increment_name=True,
auto_filename_suffix=True,
suffix_time_format='%m-%d-%y_%I%M%p',
save_model_layer_config_xlsx=True)
# SELECT DESIRED COLUMNS
stock_df = stock_df[[
'price','ma7','ma21','26ema','12ema','MACD','20sd',
'upper_band','lower_band','ema','momentum']]
# Make stock_price for twitter functions
stock_df.dropna(inplace=True)
ji.index_report(stock_df)
display(stock_df.head(3))
fig =ji.plotly_technical_indicators(stock_df,figsize=(900,500))
df['ma7'] df['price'].rolling(window = 7 ).mean() #window of 7 if daily data
df['ma21'] df['price'].rolling(window = 21).mean() #window of 21 if daily data
Moving Average Convergence Divergence (MACD) is a trend-following momentumindicator that shows the relationship between two moving averages of a security’s price. The MACD is calculated by subtracting the 26-period Exponential Moving Average (EMA) from the 12-period EMA.
The result of that calculation is the MACD line. A nine-day EMA of the MACD, called the "signal line," is then plotted on top of the MACD line, which can function as a trigger for buy and sell signals.
Traders may buy the security when the MACD crosses above its signal line and sell - or short - the security when the MACD crosses below the signal line. Moving Average Convergence Divergence (MACD) indicators can be interpreted in several ways, but the more common methods are crossovers, divergences, and rapid rises/falls. - from Investopedia
df['ewma26'] = pd.ewma(df['price'], span=26)
df['ewma12'] = pd.ewma(df['price'], span=12)
df['MACD'] = (df['12ema']-df['26ema'])
Exponentially weighted moving average
dataset['ema'] = dataset['price'].ewm(com=0.5).mean()
Bollinger bands
"Bollinger Bands® are a popular technical indicators used by traders in all markets, including stocks, futures and currencies. There are a number of uses for Bollinger Bands®, including determining overbought and oversold levels, as a trend following tool, and monitoring for breakouts. There are also some pitfalls of the indicators. In this article, we will address all these areas." Bollinger bands are composed of three lines. One of the more common calculations of Bollinger Bands uses a 20-day simple moving average (SMA) for the middle band. The upper band is calculated by taking the middle band and adding twice the daily standard deviation, the lower band is the same but subtracts twice the daily std. - from Investopedia
# Create Bollinger Bands
dataset['20sd'] = pd.stats.moments.rolling_std(dataset['price'],20)
dataset['upper_band'] = dataset['ma21'] + (dataset['20sd']*2)
dataset['lower_band'] = dataset['ma21'] - (dataset['20sd']*2)
Momentum
"Momentum is the rate of acceleration of a security's price or volume – that is, the speed at which the price is changing. Simply put, it refers to the rate of change on price movements for a particular asset and is usually defined as a rate. In technical analysis, momentum is considered an oscillator and is used to help identify trend lines." - from Investopedia
# Create Momentum
dataset['momentum'] = dataset['price']-1
## SPECIFY # OF TRAINING TEST DAYS
num_test_days=20
num_train_days=260
### SPECIFY Number of days included in each X_sequence (each prediction)
days_for_x_window=1
# Calculate number of rows to bin for x_windows
periods_per_day = ji.get_day_window_size_from_freq( stock_df, ji.custom_BH_freq() )
## Get the number of rows for x_window
x_window = periods_per_day * days_for_x_window#data_params['days_for_x_window']
print(f'X_window size = {x_window} -- ({days_for_x_window} day(s) * {periods_per_day} rows/day)\n')
## Train-test-split by the # of days
df_train, df_test = ji.train_test_split_by_last_days(stock_df,
periods_per_day =periods_per_day,
num_test_days = num_test_days,
num_train_days = num_train_days,
verbose=1, iplot=True)
###### RESCALE DATA USING MinMaxScalers FIT ON TRAINING DATA's COLUMNS ######
display(df_train.head(2).style.set_caption('df_train - pre-scaling'))
scaler_library, df_train = ji.make_scaler_library(df_train, transform=True, verbose=1)
df_test = ji.transform_cols_from_library(df_test, col_list=None,
scaler_library=scaler_library,
inverse=False)
display(df_train.head(2).style.set_caption('df_train - post-scaling'))
# Show transformed dataset
# display( df_train.head(2).round(3).style.set_caption('training data - scaled'))
# Create timeseries generators
train_generator, test_generator = ji.make_train_test_series_gens(
df_train['price'], df_test['price'],
x_window=x_window,n_features=1,batch_size=1, verbose=0)
## Make new time series generators with all stock_indicators for X_sequences
train_generator, test_generator = ji.make_train_test_series_gens(
train_data_series=df_train,
test_data_series=df_test,
y_cols='price',
x_window=x_window,
n_features=len(df_train.columns),
batch_size=1, verbose=1)
# Create keras model from model_params
import functions_combined_BEST as ji
from keras.models import Sequential
from keras.layers import Bidirectional, Dense, LSTM, Dropout
from IPython.display import display
from keras.regularizers import l2
# Specifying input shape (size of samples, rank of samples?)
n_input = x_window
n_features = len(df_train.columns) # Using stock_price and technical indicators
print(f'input shape: ({n_input},{n_features}')
input_shape=(n_input, n_features)
# Create model architecture
model2 = Sequential()
model2.add(LSTM(units=50, input_shape =input_shape,return_sequences=True))#, kernel_regularizer=l2(0.01),recurrent_regularizer=l2(0.01),
# model2.add(Dropout(0.2))
model2.add(LSTM(units=50, activation='relu'))
model2.add(Dense(1))
model2.compile(loss=ji.my_rmse, metrics=['acc',ji.my_rmse],
optimizer=optimizers.Nadam())
display(model2.summary())
epochs=5
clock = bs.Clock()
print('---'*20)
print('\tFITTING MODEL:')
print('---'*20,'\n')
# start the timer
clock.tic('')
# Fit the model
history = model2.fit_generator(train_generator,epochs=epochs)
clock.toc('')
model_key = "model_2"
hist_fname = file_dict[model_key]['fig_keras_history.ext']
summary_fname = file_dict[model_key]['model_summary']
# eval_results = ji.evaluate_model_plot_history(model1, train_generator, test_generator)
ji.evaluate_regression_model(model2,history,
train_generator=train_generator,
test_generator=test_generator,
true_test_series=df_test['price'],
true_train_series =df_train['price'],
save_history=True,history_filename=hist_fname,
save_summary=True, summary_filename=summary_fname)
## Get true vs pred data as a dataframe and iplot
df_model2 = ji.get_model_preds_df(model2,
test_generator=test_generator,
true_train_series = df_train['price'],
true_test_series = df_test['price'],
x_window=x_window,
n_features=len(df_train.columns),
scaler=scaler_library['price'],
preds_from_gen=True,
inverse_tf=True,
iplot=True, iplot_title='Model 2: True Vs Predicted S&P 500 Price')
# Compare predictions if predictions timebins shifted
df_results2, dfs_results2, df_shifted2 =\
ji.compare_eval_metrics_for_shifts(df_model2['true_test_price'],
df_model2['pred_from_gen'],
shift_list=np.arange(-4,5,1),
true_train_series_to_add=df_model2['true_train_price'],
display_results=True,
return_styled_df=True,
display_U_info=False,
return_shifted_df=True,
return_results=True)
##SAVING DFS
ji.save_model_dfs(file_dict,'model_2',
df_model=df_model2,
df_results=dfs_results2,
df_shifted=df_shifted2)
df_results2, dfs_results2, df_shifted2 =\
ji.compare_eval_metrics_for_shifts(df_model2['true_test_price'],
df_model2['pred_from_gen'],
shift_list=np.arange(-4,5,1),
true_train_series_to_add=df_model2['true_train_price'],
display_results=True,
return_styled_df=True,
display_U_info=False,
return_shifted_df=True,
return_results=True)
This means our second model can explain __% of the variance in the data ($R^2$) and that our model performed significantly better than guessing (Thiel's U value <1.0)
It is surprisng that the model's performance is so poor by adding the technical indicators.
# LOAD IN FULL STOCK DATASET using ClosingBig S&P500 WITH INDEX.FREQ=CBH
fname = file_dict['stock_df']['stock_df_with_indicators']
full_df = ji.load_processed_stock_data(processed_data_filename=fname)
# SELECT DESIRED COLUMNS
stock_df = full_df[[
'price','ma7','ma21','26ema','12ema','MACD',
'20sd','upper_band','lower_band','ema','momentum'
]]
stock_df.head()
stock_df['date_time'] = stock_df.index.to_series()
ji.index_report(stock_df)
stock_df.sort_index(inplace=True)
display(stock_df.head(2),stock_df.tail(2))
del full_df
## LOAD IN RAW TWITTER DATA, NO PROCESSING
twitter_df= ji.load_raw_twitter_file(filename='data/trumptwitterarchive_export_iphone_only__08_23_2019.csv',
date_as_index=True,
rename_map={'text': 'content', 'created_at': 'date'})
twitter_df = ji.check_twitter_df(twitter_df,text_col='content',remove_duplicates=True, remove_long_strings=True)
# MAKE TIME INTERVALS BASED ON BUSINESS HOUR START (09:30-10:30)
time_intervals= \
ji.make_time_index_intervals(stock_df,
col='date_time',
closed='right',
return_interval_dicts=False)
## USE THE TIME INDEX TO FILTER OUT TWEETS FROM THE HOUR PRIOR
twitter_df, bin_codes = ji.bin_df_by_date_intervals(twitter_df ,time_intervals)
stock_df, bin_codes_stock = ji.bin_df_by_date_intervals(stock_df, time_intervals, column='date_time')
## COLLAPSE DFs BY CODED BINS
twitter_grouped = ji.collapse_df_by_group_index_col(twitter_df,
group_index_col='int_bins',
drop_orig=True,
verbose=0)
stocks_grouped = ji.collapse_df_by_group_index_col(stock_df,
drop_orig=True,
group_index_col='int_bins',
verbose=0)
display(twitter_grouped.head(2),stocks_grouped.head(2))
ihelp_menu(ji.merge_stocks_and_tweets)
## STOCKS AND TWEETS
df_combined = ji.merge_stocks_and_tweets(stocks_grouped,
twitter_grouped,
on='int_bins',how='left',
show_summary=False)
ji.column_report(df_combined, as_df=True)
## Check for and address new null values
ji.check_null_small(df_combined);
cols_to_fill_zeros = ['num_tweets','total_retweet_count','total_favorite_count']
for col in cols_to_fill_zeros:
idx_null = ji.find_null_idx(df_combined, column=col)
df_combined.loc[idx_null,col] = 0
cols_to_fill_blank_str = ['group_content','source','tweet_times','is_retweet']
for col in cols_to_fill_blank_str:
idx_null = ji.find_null_idx(df_combined, column=col)
df_combined.loc[idx_null, col] = ""
ji.check_null_small(df_combined);
fname = file_dict['df_combined']['pre_nlp']
df_combined.to_csv(fname)
## Add nlp
df_nlp = ji.full_twitter_df_processing(df_combined,'group_content',force=True)
ji.column_report(df_nlp, as_df=True)
## Use case ratio null values as index to replace values
idx_null= ji.check_null_small(df_nlp,null_index_column='case_ratio')
df_nlp.loc[idx_null,'case_ratio'] = 0.0
ji.check_null_small(df_nlp)
## replace sentiment_class, set =-1
cols_to_replace_misleading_values = ['sentiment_class']
for col in cols_to_replace_misleading_values:
df_nlp.loc[idx_null,col] = -1
## remap sentiment class
sent_class_mapper = {'neg':0, -1:1, 'pos':2}
df_nlp['sentiment_class'] = df_nlp['sentiment_class'].apply(lambda x: sent_class_mapper[x])
bool_cols_to_ints = ['has_tweets']
for col in bool_cols_to_ints:
df_nlp[col] = df_nlp[col].apply(lambda x: 1 if x==True else 0)
ji.display_same_tweet_diff_cols(df_nlp.groupby('has_tweets').get_group(True),
columns=['group_content','content_min_clean','cleaned_stopped_lemmas'],as_md=True)
ji.check_twitter_df(df_nlp,char_limit=61*350)
# get_floats = df_nlp['content_min_clean'].apply(lambda x: isinstance(x,float))
fname =file_dict['df_combined']['post_nlp']
df_nlp.to_csv(fname)
# print(f'saved to {fname}')
def get_most_recent_filenames(full_filename,str_to_find=None):
import os
import time
fparts = full_filename.split('/')
folder = '/'.join(fparts[0:-1])
name = fparts[-1]
filelist = os.listdir(folder)
mtimes = [['file','date modified']]
for file in filelist:
if str_to_find is None:
mtimes.append([file, time.ctime(os.path.getmtime(folder+'/'+file))])
elif str_to_find in file:
mtimes.append([file, time.ctime(os.path.getmtime(folder+'/'+file))])
res = bs.list2df(mtimes)
res['date modified'] = pd.to_datetime(res['date modified'])
res.set_index('date modified',inplace=True)
res.sort_index(ascending=False, inplace=True)
most_recent = res.iloc[0]
import re
re.compile(r'()')
return res
## Load the nlp model and weights with layers set trainable=False
base_fname = file_dict['nlp_model_for_predictions']['base_filename']
nlp_model,df_model_layers = ji.load_model_weights_params(base_filename= base_fname,#'models/NLP/nlp_model0B__09-02-2019_0121pm',
load_model_params=False,
load_model_layers_excel=True,
trainable=False)
## Load in Word2Vec model from earlier
w2v_model = io.load_word2vec(file_dict=file_dict)
ihelp_menu([ji.get_tokenizer_and_text_sequences,
ji.replace_embedding_layer])
## GET X_SEQUENES FOR BINNED TWEETS AND CREATE NEW EMBEDDING LAYER FOR THEIR SIZE
text_data=df_nlp['cleaned_stopped_lemmas']
tokenizer, X_sequences = ji.get_tokenizer_and_text_sequences(w2v_model,text_data)
new_nlp_model = ji.replace_embedding_layer(nlp_model,w2v_model,text_data,verbose=2)
new_nlp_model.summary()
## GET PREDICTIONS FROM NEW MODEL
preds = new_nlp_model.predict_classes(X_sequences)
print(type(preds), preds.shape)
ji.check_y_class_balance(preds)
## add to df
df_nlp['pred_classes_int'] = preds
mapper= {0:'neg', 1:'no_change', 2:'pos'}
df_nlp['pred_classes'] = df_nlp['pred_classes_int'].apply(lambda x: mapper[x])
display(df_nlp.head())
df_combined = df_nlp
model_col_list = ['price', 'ma7', 'ma21', '26ema', '12ema', 'MACD', '20sd', 'upper_band','lower_band', 'ema', 'momentum',
'has_tweets','num_tweets','case_ratio', 'compound_score','pos','neu','neg','sentiment_class',
'pred_classes','pred_classes_int','total_favorite_count','total_retweet_count']
df_combined = ji.set_timeindex_freq(df_combined,fill_nulls=False)
df_to_model = df_combined[model_col_list].copy()
df_to_model.head(2)
## SPECIFY # OF TRAINING TEST DAYS
num_test_days=20
num_train_days=260
### SPECIFY Number of days included in each X_sequence (each prediction)
days_for_x_window=2
cols_to_exclude = ['pred_classes','has_tweets']
# Calculate number of rows to bin for x_windows
periods_per_day = ji.get_day_window_size_from_freq(df_to_model.drop(cols_to_exclude,axis=1), ji.custom_BH_freq() )
## Get the number of rows for x_window
x_window = periods_per_day * days_for_x_window#data_params['days_for_x_window']
print(f'X_window size = {x_window} -- ({days_for_x_window} day(s) * {periods_per_day} rows/day)\n')
## Train-test-split by the # of days
df_train, df_test = ji.train_test_split_by_last_days(df_to_model.drop(cols_to_exclude,axis=1),
periods_per_day =periods_per_day,
num_test_days = num_test_days,
num_train_days = num_train_days,
verbose=1, iplot=True)
###### RESCALE DATA USING MinMaxScalers FIT ON TRAINING DATA's COLUMNS ######
display(df_train.head(2).style.set_caption('df_train - pre-scaling'))
scaler_library, df_train = ji.make_scaler_library(df_train, transform=True, verbose=1)
df_test = ji.transform_cols_from_library(df_test, col_list=None,
scaler_library=scaler_library,
inverse=False)
display(df_train.head(2).style.set_caption('df_train - post-scaling'))
# Show transformed dataset
# display( df_train.head(2).round(3).style.set_caption('training data - scaled'))
# Create timeseries generators
train_generator, test_generator = ji.make_train_test_series_gens(
train_data_series=df_train,
test_data_series=df_test,
y_cols='price',
x_window=x_window,
n_features=len(df_train.columns),
batch_size=1, verbose=1)
from keras.models import Sequential
from keras import optimizers
from keras.layers import Bidirectional, Dense, LSTM, Dropout
from IPython.display import display
from keras.regularizers import l2
# Specifying input shape (size of samples, rank of samples?)
n_input =x_window
n_features = len(df_train.columns)
print(f'input shape: ({n_input},{n_features})')
input_shape=(n_input, n_features)
# Create model architecture
model3 = Sequential()
model3.add(LSTM(units=100, input_shape =input_shape,return_sequences=True,dropout=0.3,recurrent_dropout=0.3))#, kernel_regularizer=l2(0.01),recurrent_regularizer=l2(0.01),
model3.add(LSTM(units=100, activation='relu', return_sequences=False,dropout=0.3,recurrent_dropout=0.3))
model3.add(Dense(1))
model3.compile(loss=ji.my_rmse, metrics=['acc'],optimizer=optimizers.Nadam())
model3.summary()
## FIT MODEL
dashes = '---'*20
print(f"{dashes}\n\tFITTING MODEL:\n{dashes}")
## set params
epochs=5
# override keras warnings
ji.quiet_mode(True,True,True)
# Instantiating clock timer
clock = bs.Clock()
clock.tic('')
# Fit the model
history = model3.fit_generator(train_generator,
epochs=epochs,
verbose=2,
use_multiprocessing=True,
workers=3)
clock.toc('')
model_key = "model_3"
hist_fname = file_dict[model_key]['fig_keras_history.ext']
summary_fname = file_dict[model_key]['model_summary']
# eval_results = ji.evaluate_model_plot_history(model1, train_generator, test_generator)
ji.evaluate_regression_model(model3,history,
train_generator=train_generator,
test_generator=test_generator,
true_test_series=df_test['price'],
true_train_series =df_train['price'],
save_history=True,history_filename=hist_fname,
save_summary=True, summary_filename=summary_fname)
### PREFER NEW WAY - GET DF_MODEL FIRST THEN GET EVALUATE_REGRESSION INFORMATION?
## Get true vs pred data as a dataframe and iplot
df_model3 = ji.get_model_preds_df(model3,
test_generator = test_generator,
true_train_series = df_train['price'],
true_test_series = df_test['price'],
include_train_data=True,
inverse_tf = True,
scaler = scaler_library['price'],
preds_from_gen = True,
iplot = False,
verbose=1)
ji.plotly_true_vs_preds_subplots(df_model3,title='Model 3: True Vs Predicted S&P 500 Price')
# Get evaluation metrics
df_results3, dfs_results3, df_shifted3 =\
ji.compare_eval_metrics_for_shifts(df_model3['true_test_price'],
df_model3['pred_from_gen'],
shift_list=np.arange(-4,4,1),
true_train_series_to_add=df_model3['true_train_price'],
display_results=True,
display_U_info=True,
return_results=True,
return_styled_df=True,
return_shifted_df=True)
save_model=True
ji.save_model_dfs(file_dict, 'model_3',df_model3,dfs_results3,df_shifted3)
filename_prefix = file_dict['model_3']['base_filename']
if save_model ==True:
model_3_output_files = bs.save_model_weights_params(model3,
filename_prefix=filename_prefix,
auto_increment_name=True,
auto_filename_suffix=True,
suffix_time_format='%m-%d-%y_%I%M%p',
save_model_layer_config_xlsx=True)
This means our second model can explain __% of the variance in the data ($R^2$) and that our model performed significantly better than guessing (Thiel's U value <1.0)
It is surprisng that the model's performance is so poor by adding the technical indicators.
## SPECIFY # OF TRAINING TEST DAYS
reload(ji)
num_test_days=20
num_train_days=2*52*5
### SPECIFY Number of days included in each X_sequence (each prediction)
days_for_x_window=1
cols_to_exclude = ['pred_classes','has_tweets']
# Calculate number of rows to bin for x_windows
periods_per_day = ji.get_day_window_size_from_freq(df_to_model.drop(cols_to_exclude,axis=1), ji.custom_BH_freq() )
## Get the number of rows for x_window
x_window = periods_per_day * days_for_x_window#data_params['days_for_x_window']
print(f'X_window size = {x_window} -- ({days_for_x_window} day(s) * {periods_per_day} rows/day)\n')
## Train-test-split by the # of days
df_train, df_test = ji.train_test_split_by_last_days(df_to_model.drop(cols_to_exclude,axis=1),
periods_per_day =periods_per_day,
num_test_days = num_test_days,
num_train_days = num_train_days,
verbose=1, iplot=True)
###### RESCALE DATA USING MinMaxScalers FIT ON TRAINING DATA's COLUMNS ######
display(df_train.head(2).style.set_caption('df_train - pre-scaling'))
scaler_library, df_train = ji.make_scaler_library(df_train, transform=True, verbose=1)
df_test = ji.transform_cols_from_library(df_test, col_list=None,
scaler_library=scaler_library,
inverse=False)
display(df_train.head(2).style.set_caption('df_train - post-scaling'))
## Shift price values such that the y-value being predicted is the following hour's Closing Price
df_train['price_shifted'] = df_train['price'].shift(-1)
df_test['price_shifted'] = df_test['price'].shift(-1)
display(df_train[['price','price_shifted','momentum','ema','num_tweets',]].head(10))
# Drop the couple of null values created by the shift
df_train.dropna(subset=['price_shifted'], inplace=True)
df_test.dropna(subset=['price_shifted'], inplace=True)
## Drop columns and make train-test-X and y
target_col = 'price_shifted'
drop_cols = ['price_shifted','price']
X_train = df_train.drop(drop_cols,axis=1)
y_train = df_train[target_col]
X_test = df_test.drop(drop_cols,axis=1)
y_test = df_test[target_col]
import xgboost as xgb
from xgboost import plot_importance, plot_tree
from sklearn.metrics import mean_squared_error, mean_absolute_error
clock = bs.Clock()
clock.tic('')
reg = xgb.XGBRegressor(n_estimators=1000,silent=False,max_depth=4)
reg.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
early_stopping_rounds=50,
verbose=False)
## Get Predictions
pred_price = reg.predict(X_test)
pred_price_series = pd.Series(pred_price,index=df_test.index,name='pred_test_price')#.plot()
df_xgb = pd.concat([df_train['price'].rename('true_train_price'), pred_price_series,df_test['price'].rename('true_test_price')],axis=1)
df_results = ji.evaluate_regression(df_test['price'], pred_price_series,show_results=True);
clock.toc('')
fig = ji.plotly_true_vs_preds_subplots(df_xgb,true_train_col='true_train_price',
true_test_col='true_test_price',
pred_test_columns='pred_test_price',
title='Model X: True Vs Predicted S&P 500 Price')
## PLOT FEATURE IMPORTANCE
feature_importance={}
for import_type in ['weight','gain','cover']:
reg.importance_type = import_type
cur_importances = reg.feature_importances_
feature_importance[import_type] = pd.Series(data = cur_importances,
index=df_train.drop(drop_cols,axis=1).columns,
name=import_type)
df_importance = pd.DataFrame(feature_importance)
importance_fig = df_importance.sort_values(by='weight', ascending=True).iplot(kind='barh',theme='solar',
title='Feature Importance',
xTitle='Relative Importance<br>(sum=1.0)',
asFigure=True)
iplot(importance_fig)
# Compare predictions if predictions timebins shifted
df_resultsX, dfs_resultsX, df_shiftedX =\
ji.compare_eval_metrics_for_shifts(df_xgb['true_test_price'],
df_xgb['pred_test_price'],
shift_list=np.arange(-4,5,1),
true_train_series_to_add=df_xgb['true_train_price'],
display_results=True,
return_styled_df=True,
display_U_info=False,
return_shifted_df=True,
return_results=True)
df_importance.to_csv('results/modelxgb/df_importance.csv')
ji.save_model_dfs(file_dict, 'model_xgb',df_xgb,dfs_resultsX,df_shiftedX)
tree_vis = xgb.to_graphviz(reg)
tree_vis.render("xgb_full_model_",format="png",)
Using Stock Price plus technical indicators, and with a prediction time-shift of -1, we achieved:
Thiel's $U$ value of 0.33
0.0077 0.3254
This means our second model can explain 98% of the variance in the data ($R^2$) and that our model performed remarkably better than guessing (Thiel's U value).
It is surprisng how good the XGB results are in comparison to prior models.
dfs_list = {'Model 1':dfs_results1,
'Model 2':dfs_results2,
'Model 3':dfs_results3,
'XGB Regressor':dfs_resultsX}
for k,v in dfs_list.items():
new_cap = f'Evaluation Metrics for {k}'
display(v.set_caption(new_cap))